Skip to content

docs(coding-agent/edit): document Unicode escape semantics in edit tool prompts#891

Open
apoc wants to merge 1 commit intocan1357:mainfrom
apoc:docs/edit-tools-unicode-escape-guidance
Open

docs(coding-agent/edit): document Unicode escape semantics in edit tool prompts#891
apoc wants to merge 1 commit intocan1357:mainfrom
apoc:docs/edit-tools-unicode-escape-guidance

Conversation

@apoc
Copy link
Copy Markdown
Contributor

@apoc apoc commented Apr 30, 2026

Background

This replaces #889, which proposed runtime decoding of \uXXXX escape sequences in the edit/write pipeline. The Codex review on that PR (inline comment on normalize.ts:237) correctly identified that runtime decoding is unsound: it makes it impossible to write a single literal 6-char \u2192 source-code sequence to disk, because every viable JSON tool argument either decodes to the character or ends up with extra backslashes.

LLM JSON tool arg Parsed string After runtime decode On disk
"\\u2192" (natural) \u2192 (6 chars)
"\\\\u2192" (escape attempt) \\u2192 (7 chars) unchanged by lookbehind \\u2192

That breaks JS regex source (/\u2192/), Python raw strings (r"\u2192"), JSON fixtures, and any code that contains \uXXXX literally.

Approach

Document the convention in tool prompts instead of decoding at runtime. JSON natively decodes \uXXXX already, so:

  • For the character — emit "\u2192" (one backslash) in the JSON, or the literal character. Both arrive at the tool as .
  • For the literal 6-char \u2192 (regex source, raw strings, fixtures) — emit "\\u2192" (two backslashes) so the JSON parser delivers the 6 chars verbatim.
  • NEVER emit "\\u2192" (two backslashes) when you intend the character. That writes literal text, not a Unicode character.

Add a <unicode-content> section to four tool prompts spelling out both directions and the negative example:

  • write.md (content)
  • replace.md (old_text / new_text)
  • patch.md (diff+ lines and op:create payloads)
  • hashline.md (content)

Files

packages/coding-agent/CHANGELOG.md                  | 3 +++
packages/coding-agent/src/prompts/tools/hashline.md | 7 +++++++
packages/coding-agent/src/prompts/tools/patch.md    | 7 +++++++
packages/coding-agent/src/prompts/tools/replace.md  | 7 +++++++
packages/coding-agent/src/prompts/tools/write.md    | 7 +++++++
5 files changed, 31 insertions(+)

Why this is better than runtime decoding

  • No silent corruption of valid source code. /\u2192/ written to a .ts file stays as 6 chars; the tool's job is to write what JSON delivered, not to second-guess it.
  • No new failure mode. Runtime decoding meant LLMs that already emitted characters correctly would now have a hidden transformation between their intent and the file.
  • Teaches the convention. Anthropic-style tool descriptions are read attentively by the LLM. A clear "do this, not that" instruction trains future calls; a runtime decoder hides the rule.
  • Reversible. If experience shows LLMs ignore the guidance and persistently double-escape, we can revisit. A documentation change is cheap to evolve.

Verification

  • bun run format-prompts clean (no formatting changes needed)
  • All four prompts use the actual U+2192 character in prose where the character is intended, and the literal 6-char escape sequence in code spans where the literal text is intended (verified via xxd)

…ol prompts

Tool-call JSON is parsed before any of the file-writing tools sees its
content arguments, so JSON's native `\uXXXX` decoding already covers the
"write a Unicode character" case: emitting `"\u2192"` (one backslash) in
the JSON delivers `→` to the tool. To write the *literal* 6-char source
sequence `\u2192` (e.g. JS regex `/\u2192/`, Python `r"\u2192"`, JSON
fixtures, docs about Unicode), emit `"\\u2192"` (two backslashes) so the
JSON parser delivers the 6 chars verbatim.

Add a `<unicode-content>` section to the `write`, `replace`, `patch`,
and `hashline` tool prompts spelling out both directions and an explicit
"never emit two backslashes when you intend the character" rule. This
prevents the common LLM mistake of double-escaping (which produces the
literal text on disk) without requiring runtime decoding heuristics that
would make literal-escape writing impossible.
@can1357
Copy link
Copy Markdown
Owner

can1357 commented Apr 30, 2026

This is a bit wasteful in terms of prompt space tbh, although I'm aware of the issue.
I wonder, how CC deals with it? I don't remember seeing this in their prompt.

@apoc
Copy link
Copy Markdown
Contributor Author

apoc commented Apr 30, 2026

Agree, I tried to fix it in tooling, but it had another issues :/

@apoc
Copy link
Copy Markdown
Contributor Author

apoc commented Apr 30, 2026

#889

@apoc
Copy link
Copy Markdown
Contributor Author

apoc commented May 2, 2026

This is a bit wasteful in terms of prompt space tbh, although I'm aware of the issue. I wonder, how CC deals with it? I don't remember seeing this in their prompt.

Checked the CC code and it looks like there is not special care neither for this kind of edits.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants